library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
First, in order to make the problem/dataset more interesting, we categorize the alcohol variable (which is initially numerical) into three levels: low, medium, and high. The separation rule is as follows:
Low (levels.alcohol == 1) when alcohol is below
\(10.5 \%\).
Medium (levels.alcohol == 2) when alcohol is below
\(12 \%\).
High (levels.alcohol == 3) when alcohol is above
\(12 \%\).
wineDataset <- read.csv("/Users/philippebeliveau/Downloads/winequality-red.csv")
alcohol_levels <- wineDataset %>% select(alcohol) %>% mutate("Low_alcohol" = alcohol<=10.5, "normal_alcohol" = alcohol>10.5 & alcohol<=12,
"high_alcohol"= alcohol>12)
alcohol_levels$Low_alcohol <- as.integer(as.logical(alcohol_levels$Low_alcohol))
alcohol_levels$normal_alcohol <- as.integer(as.logical(alcohol_levels$normal_alcohol))
alcohol_levels$high_alcohol <- as.integer(as.logical(alcohol_levels$high_alcohol))
alcohol_levels <- alcohol_levels[, -1]
wineDataset <- cbind(wineDataset, alcohol_levels)
wine_t <- wineDataset %>% mutate(levels.alcohol=
case_when(Low_alcohol==1~1, normal_alcohol==1~2, high_alcohol==1~3))
wineDataset <- wine_t[, c(1:10, 12, 16)]
We’ll start off by exploring the data…
wineDataset <- wineDataset[complete.cases(wineDataset), ]
str(wineDataset)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ levels.alcohol : num 1 1 1 1 1 1 1 1 1 1 ...
head(wineDataset)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates quality
## 1 11 34 0.9978 3.51 0.56 5
## 2 25 67 0.9968 3.20 0.68 5
## 3 15 54 0.9970 3.26 0.65 5
## 4 17 60 0.9980 3.16 0.58 6
## 5 11 34 0.9978 3.51 0.56 5
## 6 13 40 0.9978 3.51 0.56 5
## levels.alcohol
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
colSums(is.na(wineDataset))
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates quality levels.alcohol
## 0 0 0
Comment: There are several observations:
The dataset contains 1599 observations.
The dataset has 11 explanatory variables and doesn’t have any NA values.
All variables are continuous variables, except the level of alcohol, which we will treat as a categorical variable with three distinct categories.
Some descriptive statistics:
summary<-sapply(wineDataset,function(x) c(mean(x),sd(x),min(x),max(x),length(x)))
row.names(summary)<-c("mean","sd","min","max","n")
summary
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## mean 8.319637 0.5278205 0.2709756 2.538806 8.746654e-02
## sd 1.741096 0.1790597 0.1948011 1.409928 4.706530e-02
## min 4.600000 0.1200000 0.0000000 0.900000 1.200000e-02
## max 15.900000 1.5800000 1.0000000 15.500000 6.110000e-01
## n 1599.000000 1599.0000000 1599.0000000 1599.000000 1.599000e+03
## free.sulfur.dioxide total.sulfur.dioxide density pH
## mean 15.87492 46.46779 9.967467e-01 3.3111132
## sd 10.46016 32.89532 1.887334e-03 0.1543865
## min 1.00000 6.00000 9.900700e-01 2.7400000
## max 72.00000 289.00000 1.003690e+00 4.0100000
## n 1599.00000 1599.00000 1.599000e+03 1599.0000000
## sulphates quality levels.alcohol
## mean 0.6581488 5.6360225 1.4734209
## sd 0.1695070 0.8075694 0.6526256
## min 0.3300000 3.0000000 1.0000000
## max 2.0000000 8.0000000 3.0000000
## n 1599.0000000 1599.0000000 1599.0000000
Comment: We notice that variables have different
scales. Certain variables such as free.sulfur.dioxide and
total.sulfur.dioxide have a large standard deviation,
whereas others like density have very small standard
deviation.
We are looking for features that hold significant amount of information about our response variable. Thus, we are especially looking for features with high standard deviation. free.sulfur.dioxide and total.sulfur.dioxide seems to be good candidates for helping us explain the quality of the wine.
# Histograms
par(mfrow=c(2,2))
hist(wineDataset$fixed.acidity,xlab="fixed acidity",main="Histogram")
hist(wineDataset$volatile.acidity,xlab="volatile acidity",main="Histogram")
hist(wineDataset$citric.acid,xlab="citric acid",main="Histogram")
hist(wineDataset$residual.sugar,xlab="residual sugar",main="Histogram")
par(mfrow=c(2,2))
hist(wineDataset$chlorides,xlab="chlorides",main="Histogram")
hist(wineDataset$free.sulfur.dioxide,xlab="free.sulfur.dioxide",main="Histogram")
hist(wineDataset$total.sulfur.dioxide,xlab="total.sulfur.dioxide",main="Histogram")
hist(wineDataset$density,xlab="density",main="Histogram")
par(mfrow=c(2,2))
hist(wineDataset$pH,xlab="pH",main="Histogram")
hist(wineDataset$sulphates,xlab="sulphates",main="Histogram")
hist(wineDataset$levels.alcohol,xlab="alcohol",main="Histogram")
hist(wineDataset$quality,xlab="quality",main="Histogram")
Comment: Some observations include:
We see that the majority of our explanatory variables are slightly skewed to the right. We could then consider using a different correlation coefficient that doesn’t require normality between the explanatory variable to assess the correlation between them.
We also see that our response variable is relatively normally distributed. It is expected; low and high quality wines are less observed, while average quality wines are the most frequent. There are more than 68.9% of the observations that are within one standard deviation, thus the distribution is not perfectly normally distributed.
Another interesting observation includes the imbalance of the levels of alcohol. We see that most wines have a low level of alcohol. It could be interesting to look into more details at the characteristic that compose higher level of alcohol.
We can also examine the relationship between the response variable and the explanatory variables using scatterplots.
par(mfrow=c(1,2))
# quality, fixed.acidity
plot(quality~fixed.acidity,xlab="fixed acidity",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~fixed.acidity,data=wineDataset),lwd=1.5)
# quality, volatile.acidity
plot(quality~volatile.acidity,xlab="volatile acidity",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~volatile.acidity,data=wineDataset),lwd=1.5)
par(mfrow=c(1,2))
# quality, citric.acid
plot(quality~citric.acid,xlab="citric acid",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~citric.acid,data=wineDataset),lwd=1.5)
# quality, pH
plot(quality~pH,xlab="pH",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~pH,data=wineDataset),lwd=1.5)
par(mfrow=c(1,2))
# quality, residual.sugar
plot(quality~residual.sugar,xlab="residual sugar",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~residual.sugar,data=wineDataset),lwd=1.5)
# quality, chlorides
plot(quality~chlorides,xlab="chlorides",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~chlorides,data=wineDataset),lwd=1.5)
par(mfrow=c(1,2))
# quality, free.sulfur.dioxide
plot(quality~free.sulfur.dioxide,xlab="free sulfur dioxide",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~free.sulfur.dioxide,data=wineDataset),lwd=1.5)
# quality, total.sulfur.dioxide
plot(quality~total.sulfur.dioxide,xlab="total sulfur dioxide",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~total.sulfur.dioxide,data=wineDataset),lwd=1.5)
par(mfrow=c(1,2))
# quality, density
plot(quality~density,xlab="density",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~density,data=wineDataset),lwd=1.5)
par(mfrow=c(1,2))
# quality, sulphates
plot(quality~sulphates,xlab="sulphates",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~sulphates,data=wineDataset),lwd=1.5)
Comment: Highlights from this inspection include:
There seems to have a strong negative correlation between the volatile acidity and the quality of the wine. Meaning that low quality wine have significantly higher volatile acidity level than good quality wine.
We also see that having higher levels of chloride seems to be associated with lower quality of wine.
The graph of quality and density seems interesting, as the value of density seems to be able to distinguish between low quality wine (3,5) to decent and very good wine (6,8). This make us believe that the level of density might be a good candidate to distinguish the wine quality.
Conclusion: These graphs only tell a limited story, as they examine the relationship between quality and one explanatory variable at a time.By looking at these graphs, it is not possible to deduct how these variables effect quality of wine together.
From the EDA, we observe that some variables might have outliers (e.g., total sulfur dioxide). It is therefore critical to treat them before modeling. We will use the common method of capping the lowest and the highest \(5\%\) of the observations for all the numerical variables.
library(dlookr)
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:base':
##
## transform
wineDataset$fixed.acidity <- imputate_outlier(wineDataset, fixed.acidity, method = "capping")
wineDataset$volatile.acidity <- imputate_outlier(wineDataset, volatile.acidity, method = "capping")
wineDataset$citric.acid <- imputate_outlier(wineDataset, citric.acid, method = "capping")
wineDataset$pH <- imputate_outlier(wineDataset, pH, method = "capping")
wineDataset$residual.sugar <- imputate_outlier(wineDataset, residual.sugar, method = "capping")
wineDataset$chlorides <- imputate_outlier(wineDataset, chlorides, method = "capping")
wineDataset$free.sulfur.dioxide <- imputate_outlier(wineDataset, free.sulfur.dioxide, method = "capping")
wineDataset$total.sulfur.dioxide <- imputate_outlier(wineDataset, total.sulfur.dioxide, method = "capping")
wineDataset$density <- imputate_outlier(wineDataset, density, method = "capping")
wineDataset$sulphates <- imputate_outlier(wineDataset, sulphates, method = "capping")
Comment: Note that the default capping value of the
imputate_outlier function is 95%, therefore we don’t need
to specify the capping value here.
library("corrplot")
## corrplot 0.92 loaded
correlations <- cor(wineDataset)
corrplot(correlations, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
corrplot(correlations, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, method="number", number.cex=0.60)
Comment: From the graph, we see that the pH level and the fixed acidity are strongly negatively correlated, while the fixed acidity and density are strongly positively correlated, with correlation coefficients of -0.68 and 0.67, respectively. Moreover, free sulfur dioxide and total sulfur dioxide have a correlation coefficient of 0.67, meaning they have a strong positive correlation.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
mod.col1<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol),data=wineDataset)
summary(mod.col1)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + as.factor(levels.alcohol), data = wineDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77378 -0.37303 -0.04796 0.44652 2.02834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.865e+01 2.040e+01 3.365 0.000783 ***
## fixed.acidity 7.763e-02 2.484e-02 3.125 0.001808 **
## volatile.acidity -1.020e+00 1.294e-01 -7.878 6.12e-15 ***
## citric.acid -2.722e-01 1.463e-01 -1.861 0.062908 .
## residual.sugar 3.796e-02 2.066e-02 1.837 0.066332 .
## chlorides -3.347e+00 1.001e+00 -3.344 0.000847 ***
## free.sulfur.dioxide 4.344e-03 2.477e-03 1.754 0.079707 .
## total.sulfur.dioxide -3.362e-03 8.591e-04 -3.913 9.49e-05 ***
## density -6.349e+01 2.097e+01 -3.027 0.002507 **
## pH -1.865e-01 1.889e-01 -0.987 0.323612
## sulphates 1.479e+00 1.400e-01 10.566 < 2e-16 ***
## as.factor(levels.alcohol)2 3.238e-01 4.818e-02 6.721 2.50e-11 ***
## as.factor(levels.alcohol)3 7.259e-01 8.148e-02 8.909 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6499 on 1586 degrees of freedom
## Multiple R-squared: 0.3573, Adjusted R-squared: 0.3525
## F-statistic: 73.48 on 12 and 1586 DF, p-value: < 2.2e-16
vif(mod.col1)
## GVIF Df GVIF^(1/(2*Df))
## fixed.acidity 6.102908 1 2.470407
## volatile.acidity 1.813828 1 1.346784
## citric.acid 3.049953 1 1.746411
## residual.sugar 1.500429 1 1.224920
## chlorides 1.308109 1 1.143726
## free.sulfur.dioxide 2.107918 1 1.451867
## total.sulfur.dioxide 2.440603 1 1.562243
## density 4.974059 1 2.230260
## pH 2.799577 1 1.673194
## sulphates 1.272692 1 1.128137
## as.factor(levels.alcohol) 2.494696 2 1.256766
Comment: From the VIF table, we observe that fixed acidity has the highest VIF of 6.10, followed by density with a VIF of 4.97. We will therefore proceed with a model “without” the fixed acidity variable and see if the VIF values decrease.
mod.col2<-lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol),data=wineDataset)
summary(mod.col2)
##
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + as.factor(levels.alcohol), data = wineDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.65517 -0.38253 -0.04588 0.44502 2.05982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.343e+01 1.442e+01 1.625 0.104387
## volatile.acidity -1.002e+00 1.297e-01 -7.729 1.91e-14 ***
## citric.acid -1.210e-01 1.384e-01 -0.874 0.382017
## residual.sugar 1.955e-02 1.986e-02 0.984 0.325044
## chlorides -3.783e+00 9.941e-01 -3.806 0.000147 ***
## free.sulfur.dioxide 5.117e-03 2.472e-03 2.070 0.038586 *
## total.sulfur.dioxide -4.107e-03 8.276e-04 -4.963 7.69e-07 ***
## density -1.614e+01 1.454e+01 -1.110 0.267209
## pH -5.738e-01 1.429e-01 -4.016 6.21e-05 ***
## sulphates 1.437e+00 1.397e-01 10.281 < 2e-16 ***
## as.factor(levels.alcohol)2 3.771e-01 4.519e-02 8.344 < 2e-16 ***
## as.factor(levels.alcohol)3 8.165e-01 7.635e-02 10.693 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6516 on 1587 degrees of freedom
## Multiple R-squared: 0.3534, Adjusted R-squared: 0.3489
## F-statistic: 78.84 on 11 and 1587 DF, p-value: < 2.2e-16
vif(mod.col2)
## GVIF Df GVIF^(1/(2*Df))
## volatile.acidity 1.810440 1 1.345526
## citric.acid 2.716333 1 1.648130
## residual.sugar 1.378407 1 1.174056
## chlorides 1.282706 1 1.132566
## free.sulfur.dioxide 2.086896 1 1.444609
## total.sulfur.dioxide 2.252512 1 1.500837
## density 2.378154 1 1.542126
## pH 1.593939 1 1.262513
## sulphates 1.260527 1 1.122732
## as.factor(levels.alcohol) 2.088804 2 1.202194
Comment: We see that when the fixed acidity variable is excluded from the model, all the VIF values are below a reasonable threshold of 3. Based on the VIF values, we remove fixed acidity.
Comment: We see that the slopes for different levels of alcohol are not the same, which could be a sign of interaction between alcohol and density. More specifically, line 1 and line 3 are not parallel, i.e., the effect of the level of alcohol on quality changes as a function of density. We see that the higher the value of density, the smaller the difference in quality of wine between low and high levels of alcohol. Although this interaction looks weak.
The linear regression is of the form:
\[ \begin{align} Y = &\beta_{0} + \beta_{1}volatile.acidity + \beta_{2}citric.acid + \beta_{3}residual.sugar + \beta_{4}chlorides + \beta_{5}free.sulfur.dioxide + \\ &\beta_{6}total.sulfur.dioxide + \beta_{7}density + \beta_{8}pH + \beta_{9}sulphates + \beta_{10}medium.alcohol + \beta_{11}high.alcohol + \\ & \beta_{12}(medium.alcohol*density) + \beta_{13}(high.alcohol*density) + \epsilon \end{align} \]
mod<-lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol)+as.factor(levels.alcohol)*density,data=wineDataset)
summary(mod)
##
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) *
## density, data = wineDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6574 -0.3801 -0.0422 0.4469 2.0606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.116e+01 1.787e+01 1.184 0.236674
## volatile.acidity -1.001e+00 1.299e-01 -7.703 2.33e-14 ***
## citric.acid -1.208e-01 1.387e-01 -0.871 0.384070
## residual.sugar 1.958e-02 1.995e-02 0.981 0.326542
## chlorides -3.766e+00 9.986e-01 -3.772 0.000168 ***
## free.sulfur.dioxide 5.110e-03 2.476e-03 2.064 0.039182 *
## total.sulfur.dioxide -4.113e-03 8.305e-04 -4.953 8.10e-07 ***
## density -1.385e+01 1.801e+01 -0.769 0.441777
## pH -5.759e-01 1.434e-01 -4.017 6.17e-05 ***
## sulphates 1.435e+00 1.400e-01 10.255 < 2e-16 ***
## as.factor(levels.alcohol)2 4.967e+00 2.330e+01 0.213 0.831242
## as.factor(levels.alcohol)3 5.978e+00 3.485e+01 0.172 0.863837
## density:as.factor(levels.alcohol)2 -4.604e+00 2.338e+01 -0.197 0.843891
## density:as.factor(levels.alcohol)3 -5.183e+00 3.502e+01 -0.148 0.882383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.652 on 1585 degrees of freedom
## Multiple R-squared: 0.3534, Adjusted R-squared: 0.3481
## F-statistic: 66.63 on 13 and 1585 DF, p-value: < 2.2e-16
The fitted model has the form:
\[ \begin{align} \hat{quality} = &(2.116e+01) - (1.001e+00)volatile.acidity -(1.208e-01)citric.acid + \\ & (1.958e-02)residual.sugar - (3.766e+00)chlorides + (5.110e-03)free.sulfur.dioxide - \\ &(4.113e-03)total.sulfur.dioxide -(1.385e+01)density -(5.759e-01)pH + \\ & (1.435e+00)sulphates + (4.967e+00)medium.alcohol + (5.978e+00)high.alcohol - \\ &(4.604e+00)medium.alcohol*density - (5.183e+00)high.alcohol*density \end{align} \]
We can split up the model based on the level of alcohol:
When the level of alcohol is \(low\):
\[ \begin{align} &\hat{E}(quality|alcohol=low) = (2.116e+01) - (1.001e+00)volatile.acidity - \\ &(1.208e-01)citric.acid + (1.958e-02)residual.sugar - (3.766e+00)chlorides + \\ & (5.110e-03)free.sulfur.dioxide -(4.113e-03)total.sulfur.dioxide -(1.385e+01)density - \\ & (5.759e-01)pH + (1.435e+00)sulphates \end{align} \]
When the level of alcohol is \(medium\):
\[ \begin{align} &\hat{E}(quality|alcohol=medium) = (2.116e+01 + 4.967e+00) - (1.001e+00)volatile.acidity - \\ &(1.208e-01)citric.acid + (1.958e-02)residual.sugar - (3.766e+00)chlorides + \\ & (5.110e-03)free.sulfur.dioxide -(4.113e-03)total.sulfur.dioxide - \\ & (1.385e+01+4.604e+00) density -(5.759e-01)pH + (1.435e+00)sulphates \\ \\ & = 2.613e+01 - (1.001e+00)volatile.acidity -(1.208e-01)citric.acid + \\ & (1.958e-02)residual.sugar - (3.766e+00)chlorides (5.110e-03)free.sulfur.dioxide - \\ &(4.113e-03)total.sulfur.dioxide - 18.454 density -(5.759e-01)pH + (1.435e+00)sulphates \end{align} \]
When the level of alcohol is \(high\):
\[ \begin{align} &\hat{E}(quality|alcohol=high) = (2.116e+01 + 5.978e+00) - (1.001e+00)volatile.acidity - \\ &(1.208e-01)citric.acid + (1.958e-02)residual.sugar - (3.766e+00)chlorides + \\ & (5.110e-03)free.sulfur.dioxide -(4.113e-03)total.sulfur.dioxide - \\ & (1.385e+01+ 5.183e+00)density -(5.759e-01)pH + (1.435e+00)sulphates \\ \\ & = 2.714e+01 - (1.001e+00)volatile.acidity -(1.208e-01)citric.acid + \\ & (1.958e-02)residual.sugar - (3.766e+00)chlorides + (5.110e-03)free.sulfur.dioxide - \\ & (4.113e-03)total.sulfur.dioxide - 19.033density -(5.759e-01)pH + (1.435e+00)sulphates \end{align} \]
\(R^2\) is 0.3534, meaning that all the explanatory variables in the model and the interaction variables can explain 36.15% of the variability in the response variable (quality).
Adjusted \(R^2\) = 0.3481.
Interpreting the regression coefficient associated with residual sugar (\(\beta_3\)):
\[ \begin{align} &\beta_3 = E(quality|residual sugar = x + 1) - \\ &E(quality|residual sugar = x) \end{align} \]
On average, one unit increase in residual sugar parameter causes quality of wine to increase by 1.958e-02, holding all other variables constant.
Interpreting the regression coefficient associated with the main effect of density.(\(\beta_7\)):
\[ \begin{align} &\beta_7 = E(quality|density = x + 1, alcohol=low) - \\ &E(quality|density = x,new,alcohol=low) \end{align} \]
On average , one unit increase in density causes quality of wine with low alcohol to change by -(1.385e+01) holding all other variables constant. In other words, on average, one unit increase in density causes quality of wine to decrease by 1.385e+01, when alcohol = low, holding all other variables constant.
Here we are interested in testing the hypotheses:
\(H_0: \beta_{12} = \beta_{13} =0\) vs. \(H_1:\) at least one of them \(\neq 0\)
To test the hypothesis, we proceed to build a new model called mod2, which doesn’t contain the interaction variables \(medium.alcohol*density\) and \(high.alcohol*density\).
mod2 <- lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol),data=wineDataset)
We then compare mod and mod2 using anova function.
anova(mod, mod2)
## Analysis of Variance Table
##
## Model 1: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) *
## density
## Model 2: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates + as.factor(levels.alcohol)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1585 673.89
## 2 1587 673.91 -2 -0.020233 0.0238 0.9765
This leads to a test statistic value (F-test) of 0.0238 and a p-value of 0.97. For the reasonable significance level of \(\alpha=0.05\), we fail to reject \(H_0\) and conclude that the interaction between alcohol and density is not significant, that is, the effect of density on the quality of wine does not depend on the level of alcohol, and vice-versa, keeping all other variables fixed.
Here we are interested in testing the hypotheses:
\(H_0: \beta_{10} = \beta_{11} = \beta_{12} = \beta_{13} =0\) vs. \(H_1:\) at least one of them \(\neq 0\)
To test this hypothesis, we build a new model mod3, which excludes the variables \(medium.alcohol\), \(high.alcohol\), \(medium.alcohol*density\), and \(high.alcohol*density\).
mod3 <- lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates,data=wineDataset)
We then compare mod and mod3 using anova function.
anova(mod, mod3)
## Analysis of Variance Table
##
## Model 1: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) *
## density
## Model 2: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1585 673.89
## 2 1589 728.12 -4 -54.228 31.886 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This leads to a test statistic value (F-test) of 31.886 and a p-value of 2.2e-16. For the significance level of \(\alpha=0.01\), we reject \(H_0\) and conclude that the variable alcohol is globally significant.
In order to compare the high level of alcohol with medium level of alcohol, we need to relevel the \(alcohol\) variable and build the model again with the re-leveled version of \(alcohol\).
wineDataset$levels.alcohol <-relevel(as.factor(wineDataset$levels.alcohol),2)
levels(as.factor(wineDataset$levels.alcohol))
## [1] "2" "1" "3"
Now that \(medium.alcohol\) is the reference level, we can construct the model again.
mod4<-lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol)+as.factor(levels.alcohol)*density,data=wineDataset)
summary(mod4)
##
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) *
## density, data = wineDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6574 -0.3801 -0.0422 0.4469 2.0606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.613e+01 1.976e+01 1.322 0.186300
## volatile.acidity -1.001e+00 1.299e-01 -7.703 2.33e-14 ***
## citric.acid -1.208e-01 1.387e-01 -0.871 0.384070
## residual.sugar 1.958e-02 1.995e-02 0.981 0.326542
## chlorides -3.766e+00 9.986e-01 -3.772 0.000168 ***
## free.sulfur.dioxide 5.110e-03 2.476e-03 2.064 0.039182 *
## total.sulfur.dioxide -4.113e-03 8.305e-04 -4.953 8.10e-07 ***
## density -1.846e+01 1.987e+01 -0.929 0.353128
## pH -5.759e-01 1.434e-01 -4.017 6.17e-05 ***
## sulphates 1.435e+00 1.400e-01 10.255 < 2e-16 ***
## as.factor(levels.alcohol)1 -4.967e+00 2.330e+01 -0.213 0.831242
## as.factor(levels.alcohol)3 1.011e+00 3.570e+01 0.028 0.977406
## density:as.factor(levels.alcohol)1 4.604e+00 2.338e+01 0.197 0.843891
## density:as.factor(levels.alcohol)3 -5.784e-01 3.589e+01 -0.016 0.987143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.652 on 1585 degrees of freedom
## Multiple R-squared: 0.3534, Adjusted R-squared: 0.3481
## F-statistic: 66.63 on 13 and 1585 DF, p-value: < 2.2e-16
We are interested in testing the hypothesis:
\(H_0: \beta_{13} =0\) vs. \(H_1: \beta_{13} \neq 0\).
This leads to a test statistic value (t-value) of -0.016 and a p-value of 0.987143. For the reasonable significance level of \(\alpha=0.05\), we fail to reject \(H_0\) and conclude that the there is not a significant difference in the effect of density on the quality of wine with a high level of alcohol in comparison to wine with medium level of alcohol.
In order to asses if the model is well specified and that the t-tests are valid, we make sure that the assumption of normality is met. To do that, we will looking at the residuals.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates quality
## 1 11 34 0.9978 3.51 0.56 5
## 2 25 67 0.9968 3.20 0.68 5
## 3 15 54 0.9970 3.26 0.65 5
## 4 17 60 0.9980 3.16 0.58 6
## 5 11 34 0.9978 3.51 0.56 5
## 6 13 40 0.9978 3.51 0.56 5
## levels.alcohol fitted resid
## 1 1 5.086284 -0.1325670
## 2 1 5.137421 -0.2114159
## 3 1 5.191389 -0.2939047
## 4 1 5.593930 0.6240677
## 5 1 5.086284 -0.1325670
## 6 1 5.113659 -0.1746367
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
Comment: Here it seems that the residuals are normally distributed due to the bell shape curve of the graph. However, we see that the residuals do not have exactly a mean of 0, which could lead us to think that the residuals are not fully normal. Let’s investigate this comment more in details with other residuals graph.
Comment: The residuals fall on average on the line, although we can clearly see a deviation of the residuals in the quantile range (0, -2). As most of the observation are not contained in that range, we can assume that the residuals follow the assumption of normality.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Comment: When looking at the fitted residuals of volatile acidity on the response variable, we see that the assumptions of normality is being respected. The residuals do not show any form of heteroscedasticity (tunnel shape). The residuals seems to stay around their mean.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Comment: Here I can say the same thing as the previous graph, the residuals look pretty normal and constant variance around the mean.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Comment: In terms of residual sugar, I assume that the assumption of normality is still being respected. Most of the residuals are in the same range of value, even though we see extreme value of residuals, they count for very few observations.
Comment: The boxplots seem okay and suggest no significant abnormalities, we could argue that the residuals for group = 1 are not completely following the assumptions of normality, but we believe that it is not significant enough to violate the assumptions of normality across the group (because the mean residuals value is still very close to zero). In short, there are no huge differences in the mean and variances across all the levels, even though the residuals of levels of alcohol for group = 1 doesn’t have a mean residuals of 0.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Comment: Finally, The residual graph concerning the fitted values is very well fitted and looks linear.
Conclusion: In our residual analysis, we plotted a histogram of residuals, a qqplot, residuals vs. covariates, and residuals vs. fitted values graphs. Overall, the histogram and qqplot showed that normality assumption of residuals is satisfied. Furthermore, residuals vs. covariates, and residuals vs. fitted values graphs showed that the relationships are indeed linear and that the assumption of constant variance (that is variance of residuals is constant for all values of independent variables) seems reasonable here. Therefore, the model seems to be well specified.
# setwd("/Users/siddhantarora/Desktop/HEC/Fall 2022/Stats Modelling/Final Project")
# df_bank = read.csv("/Users/siddhantarora/Desktop/HEC/Fall 2022/Stats Modelling/Final Project/bank.csv",
# sep = ";", header = TRUE)
df_bank = read.csv("/Users/philippebeliveau/Downloads/bank.csv",
sep = ";", header = TRUE)
Part a)
Perform a comprehensive exploratory data analysis (EDA) to obtain a better understanding of the dataset and perform necessary manipulations, if needed. Comment.
summary(df_bank)
## age job marital education
## Min. :19.00 Length:4521 Length:4521 Length:4521
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :41.17
## 3rd Qu.:49.00
## Max. :87.00
## default balance housing loan
## Length:4521 Min. :-3313 Length:4521 Length:4521
## Class :character 1st Qu.: 69 Class :character Class :character
## Mode :character Median : 444 Mode :character Mode :character
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
## contact day month duration
## Length:4521 Min. : 1.00 Length:4521 Min. : 4
## Class :character 1st Qu.: 9.00 Class :character 1st Qu.: 104
## Mode :character Median :16.00 Mode :character Median : 185
## Mean :15.92 Mean : 264
## 3rd Qu.:21.00 3rd Qu.: 329
## Max. :31.00 Max. :3025
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 Length:4521
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.00 Median : 0.0000 Mode :character
## Mean : 2.794 Mean : 39.77 Mean : 0.5426
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :50.000 Max. :871.00 Max. :25.0000
## y
## Length:4521
## Class :character
## Mode :character
##
##
##
sum(is.na(df_bank))
## [1] 0
dim(df_bank)
## [1] 4521 17
str(df_bank)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
head(df_bank)
## age job marital education default balance housing loan contact day
## 1 30 unemployed married primary no 1787 no no cellular 19
## 2 33 services married secondary no 4789 yes yes cellular 11
## 3 35 management single tertiary no 1350 yes no cellular 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-collar married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous poutcome y
## 1 oct 79 1 -1 0 unknown no
## 2 may 220 1 339 4 failure no
## 3 apr 185 1 330 1 failure no
## 4 jun 199 4 -1 0 unknown no
## 5 may 226 1 -1 0 unknown no
## 6 feb 141 2 176 3 failure no
#Converting character values to factors
col_list <- c("job", "marital","education", "default", "housing","loan",
"contact","month", "poutcome","y")
for (col in col_list) {
df_bank[[col]] <- as.factor(df_bank[[col]])
}
Correlations
num_list <- df_bank[,c(1,6,10,12,13,14,15)]
library(corrplot)
C <- cor(num_list)
corrplot(C, method="circle")
We can see that pdays and previous have a high correlation. This suggests that at the time of model building, it is best to include either one of them.
Distribution of the variables
#Job
library(ggplot2)
AA <- ggplot(df_bank, aes(job))
AA <- AA + geom_histogram(stat="count") + labs(title = "Job")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
AA
table(df_bank$job)
##
## admin. blue-collar entrepreneur housemaid management
## 478 946 168 112 969
## retired self-employed services student technician
## 230 183 417 84 768
## unemployed unknown
## 128 38
round((prop.table(table(df_bank$job))*100),1)
##
## admin. blue-collar entrepreneur housemaid management
## 10.6 20.9 3.7 2.5 21.4
## retired self-employed services student technician
## 5.1 4.0 9.2 1.9 17.0
## unemployed unknown
## 2.8 0.8
Approximately 60% of the jobs fall into just 3 categories in our dataset: management, blue-collar and technician, suggesting that the variable is not that well-balanced across all levels of the job.
#Marital
BB <- ggplot(df_bank, aes(marital))
BB <- BB + geom_histogram(stat="count") + labs(title = "Marital")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
BB
table(df_bank$marital)
##
## divorced married single
## 528 2797 1196
round((prop.table(table(df_bank$marital))*100),1)
##
## divorced married single
## 11.7 61.9 26.5
Majority individuals are married. This variable is relatively well balanced if we group single with divorced.
#Default
CC <- ggplot(df_bank, aes(default))
CC <- CC + geom_histogram(stat="count") + labs(title = "Default")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
CC
table(df_bank$default)
##
## no yes
## 4445 76
round((prop.table(table(df_bank$default))*100),1)
##
## no yes
## 98.3 1.7
There is only 1.7% of the individuals who have defaulted. The variable’s predictive power would be very less, given that the distribution is highly unbalanced.
#Education
DD <- ggplot(df_bank, aes(education))
DD <- DD + geom_histogram(stat="count") + labs(title = "Education")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
DD
table(df_bank$education)
##
## primary secondary tertiary unknown
## 678 2306 1350 187
round((prop.table(table(df_bank$education))*100),1)
##
## primary secondary tertiary unknown
## 15.0 51.0 29.9 4.1
51% of the individuals have a secondary education. There are some unknown observations but they are very less (4.1%) and probably they should not hamper the modelling results.
#Loan
EE <- ggplot(df_bank, aes(loan))
EE <- EE + geom_histogram(stat="count") + labs(title = "Loan")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
EE
table(df_bank$loan)
##
## no yes
## 3830 691
round((prop.table(table(df_bank$loan))*100),1)
##
## no yes
## 84.7 15.3
Approximately 85% of the individuals did not have a personal loan.
#Housing
FF <- ggplot(df_bank, aes(housing))
FF <- FF + geom_histogram(stat="count") + labs(title = "Housing Loan")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
FF
table(df_bank$housing)
##
## no yes
## 1962 2559
round((prop.table(table(df_bank$housing))*100),1)
##
## no yes
## 43.4 56.6
The distribution here is very well balanced between the individuals who have a housing loan vs. who have no housing loan.
#Contact
GG <- ggplot(df_bank, aes(contact))
GG <- GG + geom_histogram(stat="count") + labs(title = "Contact")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
GG
table(df_bank$contact)
##
## cellular telephone unknown
## 2896 301 1324
round((prop.table(table(df_bank$contact))*100),1)
##
## cellular telephone unknown
## 64.1 6.7 29.3
64.1% of the individuals were contacted via cellular means whereas only approx. 7% were contacted via telephone with 30% unknown data.
#Month
HH <- ggplot(df_bank, aes(month))
HH <- HH + geom_histogram(stat="count") + labs(title = "Month")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
HH
table(df_bank$month)
##
## apr aug dec feb jan jul jun mar may nov oct sep
## 293 633 20 222 148 706 531 49 1398 389 80 52
round((prop.table(table(df_bank$month))*100),1)
##
## apr aug dec feb jan jul jun mar may nov oct sep
## 6.5 14.0 0.4 4.9 3.3 15.6 11.7 1.1 30.9 8.6 1.8 1.2
Majority of the individuals were contacted mostly during the summer months with a sizable portion in the month of May (30.9%). It might show that contacting people during certain months can have a favorable or unfavorable impact on the bank’s marketing campaign to subscribe to term deposits.
#Poutcome
II <- ggplot(df_bank, aes(poutcome))
II <- II + geom_histogram(stat="count") + labs(title = "Poutcome")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
II
table(df_bank$poutcome)
##
## failure other success unknown
## 490 197 129 3705
round((prop.table(table(df_bank$poutcome))*100),1)
##
## failure other success unknown
## 10.8 4.4 2.9 82.0
The value is unknown for 82% of the individuals. We could treat these unknowns as missing data (NA), however, these unknowns could be the individuals which the bank has not contacted yet. Therefore, first it is important to check the relation of this variable with our target variable (y) to see how many individuals from this unknown category subscribe to the term deposit.
#distribution of the response variable y
JJ <- ggplot(df_bank, aes(y))
JJ <- JJ + geom_histogram(stat="count") + labs(title = "Response Variable: y")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
JJ
table(df_bank$y)
##
## no yes
## 4000 521
round((prop.table(table(df_bank$y))*100),1)
##
## no yes
## 88.5 11.5
Only around 11.5% of respondents to the current campaign have subscribed to the bank’s term deposit. This suggests that the dataset is unbalanced.
#Age
KK <- ggplot(df_bank, aes(age))
KK <- KK + geom_histogram(stat="count") + labs(title = "Age")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
KK
The distribution here seems to be fairly balanced where the majority individuals are aged between 25 to 50 years old.
#Balance
library(ggplot2)
LL <- ggplot(df_bank, aes(balance))
LL <- LL + geom_histogram(stat="count") + labs(title = "Balance") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlim(c(0,20000)) + ylim(c(0,25))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
LL
## Warning: Removed 386 rows containing non-finite values (`stat_count()`).
## Warning: Removed 1 rows containing missing values (`geom_bar()`).
median(df_bank$balance)
## [1] 444
The distribution here is very skewed to the right as we can see. We have very high values as well which suggests we will have to deal with outliers. Compared to all the values, the median here is only 444 which is close to zero which suggests that most of the individuals contacted have close to zero balance.
#Campaign
MM <- ggplot(df_bank, aes(campaign))
MM <- MM + geom_histogram(stat="count") + labs(title = "Campaign")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
MM
The distribution here is again skewed to the right with majority of the individuals being contacted only once or twice during this campaign.
#Day
NN <- ggplot(df_bank, aes(day))
NN <- NN + geom_histogram(stat="count") + labs(title = "Day")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
NN
mean(df_bank$day)
## [1] 15.91528
median(df_bank$day)
## [1] 16
The distribution here is very uniform with mean of approx. 15.9 almost equal to median of 16.
#Duration
OO <- ggplot(df_bank, aes(duration))
OO <- OO + geom_histogram(stat="count") + labs(title = "Duration")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
OO
median(df_bank$duration)
## [1] 185
We have some big values here with a right skewed distribution. The median here is 185 seconds (close to 3 minutes), which suggests that most individuals decide about subscrbing to a term deposit or not within the first 3 to 4 minutes of the call duration.
#Pdays
PP <- ggplot(df_bank, aes(pdays))
PP <- PP + geom_histogram(stat="count") + labs(title = "Pdays")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
PP
median(df_bank$pdays)
## [1] -1
The distribution is extremely skewed with a median of perhaps -1. This value means those individuals who have never been contacted before this campaign, that is, most of the individuals were contacted for the very first time during this campaign.
#Previous
QQ <- ggplot(df_bank, aes(previous))
QQ <- QQ + geom_histogram(stat="count") + labs(title = "Previous")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
QQ
median(df_bank$previous)
## [1] 0
This corroborates with the previous graph (pdays) which suggests that there was no communication before with the individuals contacted during this campaign (Median = 0).
Analysis of the independent variables in relation to the response variable
# Job and y
library(ggplot2)
A <- ggplot(df_bank, aes(job,fill = y))
A <- A + geom_histogram(stat="count") + labs(title = "job and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
A
We can see that individuals with management job are the ones who have subscribed the maximum to a term deposit followed by technicians.
#Marital and y
B <- ggplot(df_bank, aes(marital,fill = y))
B <- B + geom_histogram(stat="count") + labs(title = "marital and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
B
Married individuals have more tendency to subscribe to a term deposit.
#Education and y
C <- ggplot(df_bank, aes(education,fill = y))
C <- C + geom_histogram(stat="count") + labs(title = "education and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
C
Individuals with secondary education subscribe the most to a term deposit.
#Default and y
D <- ggplot(df_bank, aes(default,fill = y))
D <- D + geom_histogram(stat="count") + labs(title = "default and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
D
Individuals who have no credits in default subscribe to a term deposit whereas for individulas who have credits in default hardly subscribe.
#Housing and y
E <- ggplot(df_bank, aes(housing,fill = y))
E <- E + geom_histogram(stat="count") + labs(title = "housing and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
E
Individuals who do not have a housing loan tend to subsribe more to a term deposit compared to the individuals who have a housing loan.
#Loan and y
F <- ggplot(df_bank, aes(loan,fill = y))
F <- F + geom_histogram(stat="count") + labs(title = "loan and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
F
Individuals who do not have a personal loan subscribe more to a term deposit.
#Contact and y
G <- ggplot(df_bank, aes(contact,fill = y))
G <- G + geom_histogram(stat="count") + labs(title = "contact and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
G
When the contact communication type was Cellular, individuals subscribed more to the bank’s term deposit.
#poutcome and y
H <- ggplot(df_bank, aes(poutcome,fill = y))
H <- H + geom_histogram(stat="count") + labs(title = "poutcome and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
H
When the outcome of the previous marketing campaign was a success, individuals subscribe more to the term deposit. However, as mentioned before, the unknown category might represent the new individuals and they are the ones who subscribe the most to the term deposits. Therefore, it’s best to keep this category in for the time being we reach the modelling stage.
#age and y
I <- ggplot(df_bank, aes(age,fill = y))
I <- I + geom_histogram(stat="count") + labs(title = "age and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
I
Individuals of age 25-45 years old subscribe the most to term deposits. The same age group has also the highest counts of not subscribing to the term deposit. This suggests that individuals in this age range are contacted the most due to their high presence in our dataset.
#balance and y
library(dplyr)
y_yes <- df_bank %>% filter(df_bank$y =="yes")
y_no <- df_bank %>% filter(df_bank$y =="no")
J <- ggplot(y_yes, aes(balance))
J <- J + geom_histogram(stat="count", binwidth=10) + labs(title = "balance and y_yes") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlim(c(0,3000)) + ylim(c(0,25))
## Warning in geom_histogram(stat = "count", binwidth = 10): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`
J
## Warning: Removed 120 rows containing non-finite values (`stat_count()`).
## Warning: Removed 1 rows containing missing values (`geom_bar()`).
K <- ggplot(y_no, aes(balance))
K <- K + geom_histogram(stat="count", binwidth=10) + labs(title = "balance and y_no") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlim(c(0,3000)) + ylim(c(0,25))
## Warning in geom_histogram(stat = "count", binwidth = 10): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`
K
## Warning: Removed 865 rows containing non-finite values (`stat_count()`).
## Removed 1 rows containing missing values (`geom_bar()`).
Overall, these two graphs show that individuals who have very low balance don’t usually subscribe to a term deposit and very very few people with a low balance actually subscribe for the deposit.
#duration and y
L <- ggplot(df_bank, aes(duration,fill = y))
L <- L + geom_histogram(stat="count") + labs(title = "duration and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
xlim(c(0,1000))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
L
## Warning: Removed 107 rows containing non-finite values (`stat_count()`).
We can observe from the graph that the individuals who do not wish to subscribe for the term deposit are able to decide the same within the first few minutes of the call and those who wish to subscribe for the deposit usually take longer.
This variable has an interesting impact as it highly affects the the response variable. When the duration = 0, then y is always = no i.e. that is there is no term deposit subscription. Yet, the duration is not known before making the call. Also, after the end of the call, y is obviously known. Therefore, in order to have a realistic predictive model, this variable should be discarded from the analysis.
#month and y
M <- ggplot(df_bank, aes(month,fill = y))
M <- M + geom_histogram(stat="count") + labs(title = "month and y") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
M
In the months of May and August, individuals tend to subscribe more to term deposits.
We can similarly check the relation of other numerical variables such as previous, campaign, pdays with the response variable y.
Treatment of outliers
From the analysis done before, we saw that some of the variables have outliers and it is important to deal with them before we proceed with the actual modelling. A very common method of outliers which we can use here is capping where the outliers below the 5th percentile and above the 95th percentile are capped.
As observed, we can perform this capping on the following numeric variables: balance, campaign, pdays and previous.
library(dlookr)
df_bank$balance <- imputate_outlier(df_bank, balance, method = "capping")
df_bank$campaign <- imputate_outlier(df_bank, campaign, method = "capping")
df_bank$pdays <- imputate_outlier(df_bank, pdays, method = "capping")
df_bank$previous <- imputate_outlier(df_bank, previous, method = "capping")
Treatment of Categorical Variables
As seen before, for some of the categorical variables (Marital, Month and Job), the distribution could be improved to have a better predictive performance if we group some of the similar levels together into new broader levels.
library(rockchalk)
##
## Attaching package: 'rockchalk'
## The following objects are masked from 'package:dlookr':
##
## kurtosis, skewness
## The following object is masked from 'package:dplyr':
##
## summarize
df_bank$marital <- combineLevels(df_bank$marital, levs = c("single", "divorced"),
newLabel = "unmarried")
df_bank$month <- combineLevels(df_bank$month, levs = c("aug", "sep","oct"),
newLabel = "Autumn")
df_bank$month <- combineLevels(df_bank$month, levs = c("nov", "dec", "jan",
"feb", "mar"),
newLabel = "Winter")
df_bank$month <- combineLevels(df_bank$month, levs = c("apr","may", "jun", "jul"),
newLabel = "Summer")
df_bank$job <- combineLevels(df_bank$job, levs = c("unemployed",
"retired", "student"),
newLabel = "Not Employed")
df_bank$job <- combineLevels(df_bank$job, levs = c("admin.", "management"),
newLabel = "admin and mgnt")
df_bank$job <- combineLevels(df_bank$job, levs = c("blue-collar", "technician"),
newLabel = "blue-collar")
df_bank$job <- combineLevels(df_bank$job, levs = c("housemaid", "services"),
newLabel = "service class")
df_bank$job <- combineLevels(df_bank$job, levs = c("entrepreneur",
"self-employed"),
newLabel = "self-employed")
Brief Conclusion
Based on the exploratory data analysis (EDA) done, we have decided to remove two variables from the modelling stage: pdays (since it is highly correlated to the variable previous) and duration (since it highly affects the target variable i.e. if duration = 0 then y = no). Outliers have been dealt with. For the categorical variables, some of the similar levels have been grouped to improve their distribution. The response variable could have been treated as well by making it more balanced by using some advanced packages such as DMwR (function SMOTE), ROSE etc. as a part of advanced EDA but this has not been done here.
Part b)
Fit a regression model (as appropriate) using ‘y’ as the response variable. Model y in terms of age, job, marital and default. Provide the model summary results and write an equation for the fitted model on log-odds scale.
df_bank$y <- ifelse(df_bank$y == "yes", 1, 0)
mod1 <- glm(y ~ age + job + marital + default, data=df_bank,
family=binomial(link="logit"))
summary(mod1)
##
## Call:
## glm(formula = y ~ age + job + marital + default, family = binomial(link = "logit"),
## data = df_bank)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8884 -0.5242 -0.4589 -0.3969 2.3596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.212279 0.476216 -4.646 3.39e-06 ***
## age 0.012848 0.004511 2.848 0.00439 **
## jobNot Employed -0.047775 0.437028 -0.109 0.91295
## jobadmin and mgnt -0.422589 0.428048 -0.987 0.32352
## jobblue-collar -0.816166 0.429320 -1.901 0.05729 .
## jobservice class -0.730414 0.445387 -1.640 0.10102
## jobself-employed -0.666107 0.456582 -1.459 0.14459
## maritalunmarried 0.467715 0.099489 4.701 2.59e-06 ***
## defaultyes 0.008619 0.361493 0.024 0.98098
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3231.0 on 4520 degrees of freedom
## Residual deviance: 3163.6 on 4512 degrees of freedom
## AIC: 3181.6
##
## Number of Fisher Scoring iterations: 5
exp(mod1$coefficients)
## (Intercept) age jobNot Employed jobadmin and mgnt
## 0.1094509 1.0129306 0.9533485 0.6553477
## jobblue-collar jobservice class jobself-employed maritalunmarried
## 0.4421236 0.4817098 0.5137046 1.5963421
## defaultyes
## 1.0086559
Fitted model equation on log-odds scale:
\(\hat{log(odds(y=1|age,job,marital,default))}\) = -2.212279 + 0.012848age - 0.047775jobNot Employed -0.422589jobadmin and mgnt -0.816166jobblue-collar -0.730414jobservice class -0.666107jobself-employed + 0.467715maritalunmarried + 0.008619defaultyes
Part c)
Based on the model in part b), interpret the regression coefficients for the variables age and default on an appropriate scale.
Interpretation of regression coefficient associated with the variable age
exp(\(\hat{\beta_1}\)) = 1.0129306. It represents the multiplicative effect of age on the odds ratio of subscribing to a term deposit, holding all other variables constant. It is estimated as exp(0.012848) = 1.0129306, i.e. for every one year increase in age the odds of subscribing to a term deposit increases by 1.3%, keeping all other variables unchanged.
Interpretation of regression coefficient associated with the variable default
exp(\(\hat{\beta_8}\)) = 1.0086559. It represents the odds ratio of subscribing to a term deposit for an individual who has defaulted in comparison to an individual who has not defaulted, keeping all other variables fixed. It is estimated as exp(0.008619) = 1.0086559 that is the odds of subscribing to a term deposit are 1.0086559 times higher for individuals who have defaulted compared to individuals who haven’t defaulted, holding all other variables unchanged.
Part d)
Based on the model in part b), what is the estimated odds that an unmarried self-employed individual who is 30 years old and has defaulted will subscribe for the term deposit? What is the estimated probability that a married blue-collared individual who is 50 years old and has not defaulted will subscribe for the term deposit?
Estimated odds
\(\hat{odds(y=1|Marital=unmarried, job=self-employed, age=30, default=yes)}\) = exp(\(\hat{\beta_0}\) + \(\hat{\beta_1}\)*30 + \(\hat{\beta_6}\) + \(\hat{\beta_8}\))
= exp(-2.212279 + 0.012848*30 - 0.666107 + 0.008619)
= exp(-2.484327)
= 0.08338165
Therefore, the estimated odds that an unmarried self-employed individual who is 30 years old and has defaulted will subscribe for the term deposit is 0.08338165.
Estimated Probability
\(\hat{P(y=1|Marital=married, job=blue-collar, age=50, default=no)}\) = \(\frac{exp(\hat\beta_0+\hat\beta_1*50+\hat\beta_4)}{1+exp(\hat\beta_0+\hat\beta_1*50+\hat\beta_4)}\)
= \(\frac{exp(-2.212279+0.012848*50-0.816166)}{1+exp(-2.212279+0.012848*50-0.816166)}\)
= \(\frac{exp(-2.386045)}{1+exp(-2.386045)}\)
= = \(\frac{0.0919928}{1+0.0919928}\)
= 0.08424303
Therefore, the estimated probability that a married blue-collared individual who is 50 years old and has not defaulted will subscribe for the term deposit is 0.08424303.
Part e)
Fit another model which includes the variables age, default, balance, marital, job, campaign and an interaction between balance and marital. Provide the model summary results as obtained in R and write an equation for the fitted model on odds scale.
mod2 <- glm(y ~ age+ default + balance + marital + job + campaign + balance*marital, data=df_bank,
family=binomial(link="logit"))
summary(mod2)
##
## Call:
## glm(formula = y ~ age + default + balance + marital + job + campaign +
## balance * marital, family = binomial(link = "logit"), data = df_bank)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9731 -0.5306 -0.4500 -0.3815 2.5995
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.025e+00 4.837e-01 -4.187 2.83e-05 ***
## age 1.188e-02 4.539e-03 2.617 0.00886 **
## defaultyes 1.206e-01 3.646e-01 0.331 0.74072
## balance 8.794e-05 2.978e-05 2.953 0.00314 **
## maritalunmarried 4.951e-01 1.208e-01 4.100 4.14e-05 ***
## jobNot Employed -7.728e-02 4.396e-01 -0.176 0.86047
## jobadmin and mgnt -4.187e-01 4.306e-01 -0.972 0.33090
## jobblue-collar -7.980e-01 4.319e-01 -1.848 0.06463 .
## jobservice class -7.206e-01 4.480e-01 -1.609 0.10771
## jobself-employed -6.576e-01 4.591e-01 -1.432 0.15206
## campaign -1.149e-01 2.775e-02 -4.141 3.46e-05 ***
## balance:maritalunmarried -2.422e-05 4.562e-05 -0.531 0.59551
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3231.0 on 4520 degrees of freedom
## Residual deviance: 3133.1 on 4509 degrees of freedom
## AIC: 3157.1
##
## Number of Fisher Scoring iterations: 5
exp(mod2$coefficients)
## (Intercept) age defaultyes
## 0.1319797 1.0119515 1.1282260
## balance maritalunmarried jobNot Employed
## 1.0000879 1.6407274 0.9256338
## jobadmin and mgnt jobblue-collar jobservice class
## 0.6579124 0.4502142 0.4864597
## jobself-employed campaign balance:maritalunmarried
## 0.5180970 0.8914525 0.9999758
Fitted model equation on odds scale:
\(\hat{odds(y=1|balance,marital,job,campaign)}\) = exp(-2.025e+00 + 1.188e-02age + 1.206e-01defaultyes + 8.794e-05balance + 4.951e-01maritalunmarried - 7.728e-02jobNot Employed - 4.187e-01jobadmin and mgnt - 7.980e-01jobblue-collar - 7.206e-01jobservice class - 6.576e-01jobself-employed - 1.149e-01campaign - 2.422e-05balance*maritalunmarried)
Part f)
Based on the model in part e), interpret the regression coefficients with the main effects of variables balance and marital on an appropriate scale.
Interpretation of regression coefficient associated with the main effect of variable balance
exp(\(\hat{\beta_3}\)) = 1.0000879. For every one dollar increase in the balance for a married individual, the odds of subscribing to a term deposit is multiplied by a factor of exp(8.794e-05) = 1.0000879, when all other variables remain unchanged.
Interpretation of regression coefficient associated with the main effect of variable marital
exp(\(\hat{\beta_4}\)) = 1.6407274. It represents the odds ratio of subscribing to a term deposit which is multiplied by a factor of exp(4.951e-01) = 1.6407274 for an unmarried individual who has a balance of 0 in his/her bank account, keeping all other variables fixed.
Part g)
What is the estimated odds ratio for a 25 year old married service class individual who has a balance of 1000, has been contacted 3 times during this campaign and has defaulted vs. a 40 year old unmarried blue-collar individual who has a balance of 5000 dollars, has been contacted once during this campaign and has not defaulted?
Estimated odds ratio is calculated as follows:
\(\frac{odds(y=1|age=25, default=yes, balance=1000, marital=married, job=service class, campagin=3)}{odds(y=1|age=40, default=no, balance=5000, marital=unmarried, job=blue-collar, campagin=1)}\)
= \(\frac{exp(\hat\beta_0+\hat\beta_1*25+\hat\beta_2+\hat\beta_3*1000+\hat\beta_8+\hat\beta_{10}*3)}{exp(\hat\beta_0+\hat\beta_1*40+\hat\beta_3*5000+\hat\beta_4+\hat\beta_7+\hat\beta_{10}*1+\hat\beta_{11}*5000*1)}\)
= \(\frac{exp(-2.025e+00+1.188e-02*25+1.206e-01+8.794e-05*1000-7.206e-01-1.149e-01*3)}{exp(-2.025e+00+1.188e-02*40+8.794e-05*5000+4.951e-01-7.980e-01-1.149e-01*1-2.422e-05*5000*1)}\)
= \(\frac{0.07541418}{0.1922421}\)
= 0.3922875
Therefore, the estimated odds ratio for a 25 year old married service class individual who has a balance of 1000, has been contacted 3 times during this campaign and has defaulted vs. a 40 year old unmarried blue-collar individual who has a balance of 5000 dollars, has been contacted once during this campaign and has not defaulted is 0.3922875.
Part h)
Formally compare the models in parts b) and e) using analysis of deviance. What is your conclusion? Which model will you select on the basis of AIC and BIC?
anova(mod1,mod2,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: y ~ age + job + marital + default
## Model 2: y ~ age + default + balance + marital + job + campaign + balance *
## marital
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 4512 3163.6
## 2 4509 3133.1 3 30.508 1.079e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod1$aic
## [1] 3181.603
mod2$aic
## [1] 3157.095
BIC(mod1)
## [1] 3239.352
BIC(mod2)
## [1] 3234.093
From the output above, our simplified model is Model 1 which includes 4 variables: age, job, marital and default. We want to test if Model 1 is an adequate simplification of the complete model (Model 2) which apart from these variables also includes balance, campaign and an interaction between balance and marital.
Hypothesis is defined as follows:
\({H_0}\) : \({\beta_3}\) = \({\beta_{10}}\) = \({\beta_{11}}\) = 0
\({H_1}\) : at least one of \({\beta_3}\), \({\beta_{10}}\), \({\beta_{11}}\) \(\neq\) 0
Deviance = 30.508
p-value = 1.079e-06
Since the p-value is much lower than any reasonable \(\alpha\), we can reject the null hypothesis and conclude that Model 1 is NOT an adequate simplification of the complete model (Model 2).
Since both AIC (3157.095 < 3181.603) and BIC (3234.093 < 3239.352) values are lower for Model 2, therefore Model 2 should be selected.